Preliminary

Global Theme

# from model project.
global_theme <- theme_few() + # Theme based on S. Few's "Practical Rules for Using Color in Charts"
                  theme(plot.title = element_text(color = "darkred")) +
                  theme(strip.text.x = element_text(size = 14, colour = "#202020")) +
                  theme(plot.margin=margin(10,30,10,30))

Introduction

This analysis will explore the Wine Quality Dataset.

The winequality.names text file shows information about the data. This is summarised below.

Attributes Information

  1. fixed acidity (tartaric acid - \(\frac{g}{dm^3}\))
  2. volatile acidity (acetic acid - \(\frac{g}{dm^3}\))
  3. citric acid (\(\frac{g}{dm^3}\))
  4. residual sugar (\(\frac{g}{dm^3}\))
  5. chlorides (sodium chloride - \(\frac{g}{dm^3}\))
  6. free sulfur dioxide (\(\frac{mg}{dm^3}\))
  7. total sulfur dioxide (\(\frac{mg}{dm^3}\))
  8. density (\(\frac{g}{cm^3}\))
  9. pH
  10. sulphates (potassium sulphate - \(\frac{g}{dm^3}\))
  11. alcohol (% by volume)
  12. good (1|0)
  13. color (red|white)
  14. quality (score between 0 - 10)

We aim to discover relationships in the data and form models to get a better understanding of the variables of interest. Specifically, we will use classification and clustering methods to develop insights.

1. Classification

  • We will construct a number of binary classification models in order to predict wine color from the wines chemistry.
  • We will also construct a linear regression model to predict the wine quality.

2. Clustering

  • We will use a Heirarchical Clustering method as an unsupervised learning technique to try identify relationships between observations in our data.
  • Specifically, we wish to identify if there are any relationships with wine qulity/color and the alcohol or volatile acidity content.

0. Data Import / Exploration

This data is publicy available from https://www.kaggle.com/aleixdorca/wine-quality?select=winequality.csv

wine <- read_csv("./winequality.csv")

Let’s verify the dimensions and view the types.

## tibble [6,497 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fixed acidity       : num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile acidity    : num [1:6497] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric acid         : num [1:6497] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual sugar      : num [1:6497] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:6497] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free sulfur dioxide : num [1:6497] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total sulfur dioxide: num [1:6497] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:6497] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:6497] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:6497] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:6497] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:6497] 5 5 5 6 5 5 5 7 7 5 ...
##  $ good                : num [1:6497] 0 0 0 0 0 0 0 1 1 0 ...
##  $ color               : chr [1:6497] "red" "red" "red" "red" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `fixed acidity` = col_double(),
##   ..   `volatile acidity` = col_double(),
##   ..   `citric acid` = col_double(),
##   ..   `residual sugar` = col_double(),
##   ..   chlorides = col_double(),
##   ..   `free sulfur dioxide` = col_double(),
##   ..   `total sulfur dioxide` = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double(),
##   ..   good = col_double(),
##   ..   color = col_character()
##   .. )

Simple Data Clean

Let’s confirm that there are no NAs.

apply(wine, 2, function(x) any(is.na(x)))
##        fixed acidity     volatile acidity          citric acid 
##                FALSE                FALSE                FALSE 
##       residual sugar            chlorides  free sulfur dioxide 
##                FALSE                FALSE                FALSE 
## total sulfur dioxide              density                   pH 
##                FALSE                FALSE                FALSE 
##            sulphates              alcohol              quality 
##                FALSE                FALSE                FALSE 
##                 good                color 
##                FALSE                FALSE

We should coerce color to a factor. We should also discretise quality as they are categorical labels ranked from \(0\) to \(10\).

wine$color <- factor(wine$color)                    # factor color
wine$quality.factor <- factor(wine$quality)         # factor quality

str(wine)
## tibble [6,497 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fixed acidity       : num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile acidity    : num [1:6497] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric acid         : num [1:6497] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual sugar      : num [1:6497] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:6497] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free sulfur dioxide : num [1:6497] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total sulfur dioxide: num [1:6497] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:6497] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:6497] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:6497] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:6497] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:6497] 5 5 5 6 5 5 5 7 7 5 ...
##  $ good                : num [1:6497] 0 0 0 0 0 0 0 1 1 0 ...
##  $ color               : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality.factor      : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `fixed acidity` = col_double(),
##   ..   `volatile acidity` = col_double(),
##   ..   `citric acid` = col_double(),
##   ..   `residual sugar` = col_double(),
##   ..   chlorides = col_double(),
##   ..   `free sulfur dioxide` = col_double(),
##   ..   `total sulfur dioxide` = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double(),
##   ..   good = col_double(),
##   ..   color = col_character()
##   .. )

Let us rename features for better interpretability.

wine <-
  wine %>% rename(
      fixed.acidity = `fixed acidity`,
      volatile.acidity = `volatile acidity`,
      citric.acid = `citric acid`,
      residual.sugar = `residual sugar`,
      free.sulfur.dioxide = `free sulfur dioxide`,
      total.sulfur.dioxide = `total sulfur dioxide`
    )

colnames(wine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "good"                 "color"                "quality.factor"

Exploring Quality

We should explore the distribution of wine qualities.

## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5
wine %>%
  ggplot(aes(quality, fill=color, color=c("red", "white"))) +
    geom_histogram(binwidth = .5, col="black") +
        facet_grid(color ~ .) +
        labs(title="Wine Quality Histogram", subtitle="Wine Quality for Red and White Wine") +
        global_theme

The above confirms that both red and white wine have a very similar quality distribution. Let us consider the quality distribution collectively.

wine %>%
  ggplot( aes(x=quality) ) +
  geom_histogram() +
  labs(title="Wine Quality Histogram", subtitle="Wine Quality for ALL Wine") +
  global_theme

It is evident that the distribution of quality is not uniform which could be problematic for classification tasks. The data seems to follow a normal distribution. Let’s take a closer look.

wine %>% 
  ggplot( aes(x=quality) ) + 
  geom_histogram( aes(y=..count../sum(..count..)) ) +
  ylab("proportion") +
  stat_function(fun = dnorm, n = 101, args = list(mean = mean(wine$quality), sd = sd(wine$quality)), size=1.0, color="#ec6d5f") +
  labs(title="Wine Quality Histogram", subtitle="Normal Distribution Fit") +
  global_theme 

The models we train will have much more data for mid level qualities than low and high quality wines which could potentially lead to imbalanced predictions. Let’s get more quantitative.

# GET CONFIDENCE INTERVAL OF NORMAL DISTRIBUTION
confidence_interval <- function(mu, sigma, z) {
    lower <- mu - z * sigma
    upper <- mu + z * sigma
    
    return( c(lower, upper) )
}

mu <- mean(wine$quality)
sigma <- sd(wine$quality)
z <- 1.645 # 90% confidence interval

confidence_interval(mu, sigma, z)
## [1] 4.381873 7.254883

Thus, around \(90\%\) of our data has a quality of either \(5\), \(6\), or \(7\). This means that \(10\%\) of data has a quality of \(3\), \(4\), \(8\), or \(9\).

We can conclude that the testing and validation sets must be stratified samples to be representative of the original imbalanced data set for classifying quality. We can also use Synthetic Minority Oversampling Technique (SMOTE) on the training set to handle the imbalanced data problem. In short, we add synthetic instances to the minority classes to create a more uniform distribution. These synthetic instances are intelligently chosen by drawing a line between the examples in the feature space and creating a new sample at a point along that line.

https://arxiv.org/pdf/1106.1813.pdf (SMOTE paper, 2002)

We will have to address these points when preparing the data for classification models of quality.

Exploring Feature Variables

Let us firstly identify any correlations between feature variables. We will compute the Pearson’s coefficient.

num.vars <- setdiff( names(wine), c("color", "quality.factor") ) # correlation matrix only accepts class num
corr <- round( cor( wine[ , num.vars ], method="pearson" ), 2 )  # Lets find correlation using cor()

corr
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 1.00             0.22        0.32          -0.11
## volatile.acidity              0.22             1.00       -0.38          -0.20
## citric.acid                   0.32            -0.38        1.00           0.14
## residual.sugar               -0.11            -0.20        0.14           1.00
## chlorides                     0.30             0.38        0.04          -0.13
## free.sulfur.dioxide          -0.28            -0.35        0.13           0.40
## total.sulfur.dioxide         -0.33            -0.41        0.20           0.50
## density                       0.46             0.27        0.10           0.55
## pH                           -0.25             0.26       -0.33          -0.27
## sulphates                     0.30             0.23        0.06          -0.19
## alcohol                      -0.10            -0.04       -0.01          -0.36
## quality                      -0.08            -0.27        0.09          -0.04
## good                         -0.05            -0.15        0.05          -0.06
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.30               -0.28                -0.33    0.46
## volatile.acidity          0.38               -0.35                -0.41    0.27
## citric.acid               0.04                0.13                 0.20    0.10
## residual.sugar           -0.13                0.40                 0.50    0.55
## chlorides                 1.00               -0.20                -0.28    0.36
## free.sulfur.dioxide      -0.20                1.00                 0.72    0.03
## total.sulfur.dioxide     -0.28                0.72                 1.00    0.03
## density                   0.36                0.03                 0.03    1.00
## pH                        0.04               -0.15                -0.24    0.01
## sulphates                 0.40               -0.19                -0.28    0.26
## alcohol                  -0.26               -0.18                -0.27   -0.69
## quality                  -0.20                0.06                -0.04   -0.31
## good                     -0.16                0.01                -0.05   -0.28
##                         pH sulphates alcohol quality  good
## fixed.acidity        -0.25      0.30   -0.10   -0.08 -0.05
## volatile.acidity      0.26      0.23   -0.04   -0.27 -0.15
## citric.acid          -0.33      0.06   -0.01    0.09  0.05
## residual.sugar       -0.27     -0.19   -0.36   -0.04 -0.06
## chlorides             0.04      0.40   -0.26   -0.20 -0.16
## free.sulfur.dioxide  -0.15     -0.19   -0.18    0.06  0.01
## total.sulfur.dioxide -0.24     -0.28   -0.27   -0.04 -0.05
## density               0.01      0.26   -0.69   -0.31 -0.28
## pH                    1.00      0.19    0.12    0.02  0.03
## sulphates             0.19      1.00    0.00    0.04  0.03
## alcohol               0.12      0.00    1.00    0.44  0.39
## quality               0.02      0.04    0.44    1.00  0.76
## good                  0.03      0.03    0.39    0.76  1.00

It would be more beneficial if we visualise the results.

# Let us plot a Correlogram using above returned results.
corr %>%
        ggcorrplot(
          hc.order = TRUE, 
          type = "lower", 
          lab = TRUE, 
          lab_size = 3, 
          colors = c("#ec6d5f", "white", "#27c9b8"), 
          title="Correlogram of Wine Data", 
          ggtheme=global_theme
        )

A summary of the findings is shown below:

  • Unsurprisingly, the better quality wine has a greater tendency to be graded as good by experts.
  • The quality has a slight tendency to go up when the alcohol content goes up.
  • The quality has a slight tendency to go down when desity, chlorides or volatile acidity of the wine increases).
  • Finally, coefficients close to zero mean that there is no linear correlation.
    • sulphates, free sulfur dioxide, total sulfur dioxide, color, pH, citric acid, residual sugar

Let us look closely into the highly correlated attributes as below. We will also separate by color to start identifying trends between red and white wine.

plot1 <- wine %>% 
            ggplot(aes(x=quality.factor, y=alcohol) ) + 
            geom_boxplot( aes(fill=color) ) +
            labs(title="Alcohol vs Quality") +
            xlab("quality") +
            global_theme
  
plot2 <- wine %>% 
          ggplot(aes(x=quality.factor, y=density) ) + 
          geom_boxplot( aes(fill=color) ) +
          labs(title="Density vs Quality") +
          xlab("quality") +
          global_theme

plot3 <- wine %>% 
          ggplot(aes(x=quality.factor, y=chlorides) ) + 
          geom_boxplot( aes(fill=color) ) +
          labs(title="Chlorides vs Quality") +
          xlab("quality") +
          global_theme
  
plot4 <- wine %>% 
          ggplot(aes(x=quality.factor, y=volatile.acidity) ) + 
          geom_boxplot( aes(fill=color) ) +
          labs(title="Volatile Acidity vs Quality") +
          xlab("quality") +
          global_theme

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

A summary of the findings is shown below:

Overall Insights

  • It is evident that the best quality wines have the highest median alcohol percentage.
  • The best quality wines have the lowest median density.
    • assuming that volume is constant, this means that high quality wines have less mass per litre
    • this may mean less impurities due to better filtering methods (I am just speculating here)
  • The median chloride levels for the best quality wines are slightly lower than other wine qualities.
    • chlorides refers to the amount of salt in the wine
    • this may not be a relevant observation as there are a small number of instances for quality 9
  • Lower grade wines have a slightly higher medium volatile acidity
    • If volatile acidity levels are too high in the wine, can lead to a displeasing, vinegar taste.
    • This observation makes sense

A classification task for wine quality based on wine chemistry seems difficult as the distributions above seem to have very slight variations. This becomes even more difficult when we have an imbalanced data set. We should keep in mind that a model off by \(\pm 1\) quality rating may not be too bad. We may care about wine being high vs medium quality more than \(6\) vs \(7\) quality.

We should come back to these points when creating our models.

Comparing Red and White Wine

Firstly, we can see that there are no quality \(9\) wines that are red. In fact, there are only \(5\) instances that are this quality. We should remove these instances as no classifier can explain a very high quality wine from such a small number of observations, even with any SMOTE techniques.

df <- wine                        # df will hold the untouched data set if needed
wine <- wine[wine$quality != 9,]  # filter out the 5 quality 9 obs.

The distributions for density, chlorides and volatile acidity seem very separated between red and white wines. A classification task for determining wine color based on its chemistry does not seem that complex as the data seems linearly separable for many feature variables. Let’s take a closer look.

dense1 <- wine %>% 
            ggplot( aes(x=alcohol) ) + 
            geom_density( aes(fill=color), alpha=0.5 ) +
            ylab("") +
            labs(title="Alcohol") +
            global_theme

dense2 <- wine %>% 
            ggplot( aes(x=density) ) + 
            geom_density( aes(fill=color), alpha=0.5 ) +
            ylab("") +
            labs(title="Wine Density") +
            global_theme

dense3 <- wine %>% 
            ggplot( aes(x=chlorides) ) + 
            geom_density( aes(fill=color), alpha=0.5 ) +
            ylab("") +
            labs(title="Chlorides") +
            global_theme

dense4 <- wine %>% 
            ggplot( aes(x=volatile.acidity) ) + 
            geom_density( aes(fill=color), alpha=0.5 ) +
            ylab("") +
            labs(title="Volatile Acidity") +
            global_theme

grid.arrange(dense1, dense2, dense3, dense4, ncol=2)

The above confirms good separation between density, chlorides and volatile acidity. Alcohol content does not divide wine color well as expected from the box plots.

The correlation matrix also shows that total sulfur dioxide is highly positively correlated with color. Let us plot total sulfur dioxide and chlorides to view any trends.

wine %>%
  ggplot(aes(x=chlorides, y=total.sulfur.dioxide, color=color)) +
  geom_point(size=1, alpha=0.5) +
  labs(title="Total Sulfur Dioxide vs Chlorides") +
  global_theme

It is evident that wine color is linearly separable between these two variables. A simple multi-variable linear model seems like it would perform very well.

That is enough exploring! Let us prepare our data for training.


1. Classification of Wine Color

1.1 Preparation

The first step is to divide our data into a training, calibration and test set for modeling.

  • Train - \(90\%\)
  • Test - \(10\%\)
# PREPARE DATASET
d <- wine

# CREATE TRAIN & TEST SET
set.seed(4009)
d$rgroup <- runif( nrow(d) )
dTrainAll <- d %>% subset(rgroup <= 0.9) %>% as.data.frame  # 90% train
dTest <- d %>% subset(rgroup > 0.9) %>% as.data.frame       # 10% test

# ISOLATE FEATURE VARIBLES
outcomes <- c("color")
vars <- setdiff(colnames(d), c(outcomes, "rgroup")  )

# SEPARATE CATEGORICAL AND NUMERICAL FEATURE VARIABLES
catVars <- vars[ sapply(dTrainAll[, vars], class) %in% c("factor", "character") ]  # should be empty
numericVars <- vars[ sapply(dTrainAll[, vars], class) %in% c("numeric", "integer") ]

# CLEAN MEMORY
rm(list = c("d"))

Let us now make our response variable to be color and set the positive outcome to be red. We shall also create our calibration set to allow cross-validation during our single variable analysis.

  • Train - \(90\%\)
  • Calibration - \(10\%\)
# SET RESPONSE AND POSITIVE OUTCOME VARS
outcome <- "color"
pos <- "red"

# CREATE CALIBRATION SET (10% - TRAIN)
useForCal <- rbinom( n=nrow(dTrainAll), size = 1, prob = 0.1 ) > 0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)

Let us find the majority class classifier as our null model in order to identify a good baseline.

col.tab <- prop.table(table(wine$color))
red <- round(col.tab[1], 4)
white <- round(col.tab[2], 4)

ggplot(wine, aes(x = color, fill = color)) +
    geom_bar(aes(y = (..count..) / sum(..count..)), width = 0.75, show.legend = FALSE) + 
    scale_y_continuous(labels = percent_format()) +
    annotate("text", x = 'red', y = red + 0.02, label = paste0(red * 100, "%"), size=rel(6.0)) +
    annotate("text", x = 'white', y = white + 0.02, label = paste0(white * 100, "%"), size=rel(6.0)) +
    labs(title="Wine Color Frequencies") +
    ylab("Relative Frequencies") +
    global_theme

It is evident that \(75.37\%\) of wine is white in our data set. This concludes that our null model will perform at an accuracy of \(75.37\%\).

The unequal distribution in the color variable can causes the performance of a given classifier to be biased towards majority class (white). Much research has been done to prove that classifiers outperform on balanced data sets over imbalanced ones. Yanminsun, & Wong, Andrew & Kamel, Mohamed S.. (2011). Classification of imbalanced data: a review is one interesting paper.

We will use the SMOTE (Synthetic Minority Oversampling Technique) oversampling method to smartly produce synthetic instances to balance the data set. We will only use SMOTE for the training data as we want the test and calibration sets to reflect real observations when measuring model accuracies.

# SMOTE - Synthetic Minority Oversampling Technique
balancedTrain <- SMOTE(color ~., dTrain, perc.over = 100, k = 5, perc.under = 200)

Let’s verify that the synthetic sampling worked.

col.tab <- prop.table(table(balancedTrain$color))
red <- round(col.tab[1], 4)
white <- round(col.tab[2], 4)

ggplot(balancedTrain, aes(x = color, fill = color)) +
    geom_bar(aes(y = (..count..) / sum(..count..)), width = 0.75, show.legend = FALSE) + 
    scale_y_continuous(labels = percent_format()) +
    annotate("text", x = 'red', y = red + 0.02, label = paste0(red * 100, "%"), size=rel(6.0)) +
    annotate("text", x = 'white', y = white + 0.02, label = paste0(white * 100, "%"), size=rel(6.0)) +
    labs(title="SMOTE Wine Color Frequencies") +
    ylab("Relative Frequencies") +
    global_theme

Now that we have SMOTTED our training dataset, we can start creating some models.

1.2 Single Variable Modelling

We should begin with finding the best single-variable model to identify which variables have a significant relationship with the response variable (color). We can then use this information to help build our multi-variable models.

Let’s build two functions as we will need to deal with numerical and categorical variables separately. These functions will help make predictions and store the results in a new column of the dataframe named with the prefix ‘pred’.

# MAKE PREDICTIONS ON CATEGORICAL VARIABLES
mkPredC <- function(outCol, varCol, appCol) {
  
  pPos <- sum(outCol==pos)/length(outCol)
  
  naTab <- table(as.factor(outCol[is.na(varCol)]))
  pPosWna <- (naTab/sum(naTab))[pos]
  
  vTab <- table(as.factor(outCol),varCol)
  pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
  
  pred <- pPosWv[appCol]
  pred[is.na(appCol)] <- pPosWna
  pred[is.na(pred)] <- pPos

  pred

}

# DISCRETISATION - BIN NUMERIC FEATURES INTO CATEGORICAL RANGES AND THEN CALL mkPredC TO MAKE PREDICTIONS
mkPredN <- function(outCol, varCol, appCol) {
  
  # DISCRETISE CRV
  cuts <- unique( as.numeric( quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T) ) ) # equal frequency binning
  varC <- cut(varCol, cuts)
  appC <- cut(appCol, cuts)
  
  # MAKE PREDICTION AS FOR CATEGORICAL
  mkPredC(outCol, varC, appC)
  
}

We will also create an AUC function to calculate the area under the ROC curve to help compare performances.

calcAUC <- function(predcol, outcol) {
    perf <- performance(prediction(predcol, outcol == pos), 'auc')
    as.numeric(perf@y.values)
}

We now loop through each categorical and numerical variable and apply the appropriate predict function as well as the AUC function. If the AUC is above the specified threshold, we will display the training data AUC and the calibration data AUC to compare. When the training and calibration AUC’s are similar, it indicates that the data is representative of the full data set and is likely to generalise well on the test data. The model will be overfitting when the training AUC is much greater than the calibration AUC. The model is too sensitive to small variations in the data causing it to generalise poorly on new data.

thresh = 0.8

# CATEGORICAL PREDICTIONS
for(var in catVars) {
    pred <- paste0("pred", var)
    
    balancedTrain[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], balancedTrain[, var])
    dCal[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], dCal[, var])
    dTest[, pred] <- mkPredC(balancedTrain[, outcome], balancedTrain[, var], dTest[, var])
    
    aucTrain <- calcAUC(balancedTrain[, pred], balancedTrain[, outcome])
    
    if(aucTrain >= thresh) {
        aucCal <- calcAUC(dCal[, pred], dCal[, outcome])
        print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f", pred, aucTrain, aucCal))
    }
}

# NUMERICAL PREDICTIONS
for(var in numericVars) {
    pred <- paste0("pred", var)
    
    balancedTrain[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], balancedTrain[, var])
    dTest[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], dTest[, var])
    dCal[, pred] <- mkPredN(balancedTrain[, outcome], balancedTrain[, var], dCal[, var])
    
    aucTrain <- calcAUC(balancedTrain[, pred], balancedTrain[, outcome])
    
    if(aucTrain >= thresh) {
        aucCal <- calcAUC(dCal[, pred], dCal[, outcome])
        print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f", pred, aucTrain, aucCal))
    }
}
## [1] "predvolatile.acidity, trainAUC: 0.910 calibrationAUC: 0.885"
## [1] "predresidual.sugar, trainAUC: 0.886 calibrationAUC: 0.883"
## [1] "predchlorides, trainAUC: 0.961 calibrationAUC: 0.969"
## [1] "predfree.sulfur.dioxide, trainAUC: 0.850 calibrationAUC: 0.845"
## [1] "predtotal.sulfur.dioxide, trainAUC: 0.955 calibrationAUC: 0.962"
## [1] "preddensity, trainAUC: 0.808 calibrationAUC: 0.774"
## [1] "predsulphates, trainAUC: 0.836 calibrationAUC: 0.814"

The above show the feature variables with an AUC above the specified threshold. The similarity in training and calibration AUCs indicate that the training data is representative of the full data set. We should now use cross-validation to confirm that our training data is appropriate. The function below reproduces new training and calibration sets and runs the prediction function. This is repeated for a number of folds (e.g. 100) where the mean and standard deviation of the AUCs is returned.

It is important to check that we did not get a lucky calibration set to verify our data. It is also important to relise that our AUC calculations above use synthetic samples from SMOT where cross-validation does not.

# K-FOLD CROSS VALIDATION
fCross <- function(var) {
    useForCalRep <- rbinom(n = nrow(dTrainAll), size = 1, prob = 0.1 ) > 0 
    predRep <- mkPredN(dTrainAll[!useForCalRep, outcome],
                       dTrainAll[!useForCalRep, var],
                       dTrainAll[useForCalRep, var])
    calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}

k <- 100
aucs <- replicate(k, fCross("total.sulfur.dioxide"))

print(sprintf("cross-validation mean: %4.3f, cross-validation standard deviation: %4.3f", mean(aucs), sd(aucs)))
## [1] "cross-validation mean: 0.947, cross-validation standard deviation: 0.010"

The total.sulfur.dioxide variable gives a cross-validation mean of \(0.948\) which is very close to \(0.958\) from the AUC calculations above. A small standard deviation verifies that the cross-validation AUC does not vary much. Running this function with the other variables that met our threshold gives similar results which confirms that our training set is ready to be used in further modeling.


1.3 Logistic Regression

We have a very high performing single-variable model. Now it is time to see if we can improve by taking advantage of multiple feature variables to make predictions. One model we could train is a K Nearest Neighbour classifier. KNN is expensive both in time and space. Sometimes we can get similar results with more efficient methods such as logistic regression. Logistic regression seems like a good model as we have a relatively small number of feature variables and categorical variables have a small number of levels.

Before training any models, it may be wise to do some feature selection considering that the winequality.names textfile states that “not all input variables are necessarily relevant”. We will perform some feature selection by considering the loglikelyhoods as shown below.

We will first get the base rate of the NULL model.

# Define a convenience function to compute log likelihood.
logLikelihood <- function(outCol, predCol) {
  sum(ifelse(outCol==pos,log(predCol),log(1-predCol)))
}

# Compute the base rate of a NULL model
baseRateCheck <- 
  logLikelihood(dCal[,outcome], sum(dCal[,outcome]==pos)/length(dCal[,outcome]))

baseRateCheck
## [1] -310.4419

We shall now loop through all the variables and select only the most influential features.

selVars <- c()
minStep <- 5

# CATEGORICAL FEATURE SELECTION
for(v in catVars) {
  pi <- paste('pred', v, sep='')
  liCheck <- 2*((logLikelihood(dCal[,outcome], dCal[,pi]) - 1 - baseRateCheck))
  
  if(liCheck > minStep) {
    print(sprintf("%s, calibrationScore: %g",pi ,liCheck))
    selVars <- c(selVars, v)
  }
}

# NUMERIC FEATURE SELECTION
for(v in numericVars) {
  pi <- paste('pred', v, sep='')
  liCheck <- 2*((logLikelihood(dCal[,outcome], dCal[,pi]) - 1 - baseRateCheck))

  if(liCheck >= minStep) {
    print(sprintf("%s, calibrationScore: %g", pi, liCheck))
    selVars <- c(selVars, v)
  }
}
## [1] "predfixed.acidity, calibrationScore: 32.0468"
## [1] "predresidual.sugar, calibrationScore: 159.033"
## [1] "predchlorides, calibrationScore: 379.684"
## [1] "predfree.sulfur.dioxide, calibrationScore: 72.9201"
## [1] "predtotal.sulfur.dioxide, calibrationScore: 352.099"
## [1] "preddensity, calibrationScore: 7.81743"
## [1] "predsulphates, calibrationScore: 45.4651"

The important features are:

selVars
## [1] "fixed.acidity"        "residual.sugar"       "chlorides"           
## [4] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [7] "sulphates"

We should also prepare our data to avoid feeding the model duplicate information causing multicollinearity. We manage this by removing variables that are too highly correlated to each other.

By referencing the correlogram in the exploration section:

  • total.sulfur.dioxide and free.sulfur.dioxide
  • residual.sugar and density
  • chlorides and density
  • sulphates and density

We will keep the variable total.sulfur.dioxide as well as density and remove the others. Note that we could have kept total.sulfur.dioxide, residual.sugar, chlorides and sulphates, however, we should consider the option that results in the lowest number of dimensions.

drop <- c('fixed.acidity', 'residual.sugar', 'free.sulfur.dioxide', 'chlorides', 'sulphates')

selVars <- selVars[!selVars %in% drop]

selVars
## [1] "total.sulfur.dioxide" "density"

Now we can create our binary logistic regressor.

f <- paste(outcome,' == red ~ ',paste(selVars,collapse=' + '),sep='')

gmodel <- glm(as.formula(f),data=dTrain, family=binomial(link='logit'))

log.pred <- c(
    calcAUC(predict(gmodel, newdata = dTrain), dTrain[, outcome]),
    calcAUC(predict(gmodel, newdata = dCal), dCal[, outcome]),
    calcAUC(predict(gmodel, newdata = dTest), dTest[, outcome]))

log.pred
## [1] 0.9822171 0.9940508 0.9810298

We can see that our model performs similarly on the three datasets meaning that it generalises well. The AUCs are also very high.

Let’s plot the logistic curve to try and visualise the performance.

predict.data <- data.frame(probability.of.color = gmodel$fitted.values, color = dTrain$color)

predict.data <- predict.data[ order(predict.data$probability.of.color, decreasing = FALSE), ]

predict.data$rank <- 1:nrow(predict.data)

predict.data %>% 
  ggplot( aes(x=rank, y=probability.of.color)) +
  geom_point( aes(color=color), alpha=1, shape=4, stroke=2 ) +
  xlab("index") +
  ylab("Pred Probability of Red") +
  labs(title="Logistic Regression Curve") +
  global_theme

We can see that most of the red wines are predicted to have a relatively high probability and most white wines are predicted to have a relatively low probability. Thus, it seems that we have a good model but let us cross-validate to make sure.

# K-FOLD CROSS VALIDATION
fCrossLogit <- function() {
    useForCalRep <- rbinom(n = nrow(dTrainAll), size = 1, prob = 0.1 ) > 0 
    predRep <- predict(gmodel, newdata = dTrainAll[useForCalRep, ])
    
    calcAUC(predRep, dTrainAll[useForCalRep, outcome])
    
}

k <- 100
aucs <- replicate(k, fCrossLogit())

print(sprintf("cross-validation mean: %4.3f, cross-validation standard deviation: %4.3f", mean(aucs), sd(aucs)))
## [1] "cross-validation mean: 0.984, cross-validation standard deviation: 0.006"

Again, we have a cross-validation mean that is very close to the AUC calculations above. A small standard deviation verifies that the cross-validation AUC does not vary much. Viewing the logistic regression curve and cross-validating should be enough to justify that we have a good model here. We will use more sophisticated analysis techniques when building a multi-class classifier for wine quality.


1.4 Decision Trees

We will now explore a decision tree model for classification. Specifically, we will build three models.

  1. Using the raw feature variables
  2. Using the pred prefixed variables
  3. Using the raw features with different hyperparameters
# MODEL ONE

fmla1 <- paste0(outcome,"== 'red' ~ ", paste(c(catVars, numericVars), collapse = " + "))
tmodel1 <- rpart(fmla1, data = balancedTrain)

decision.tree.1 <- c(
    calcAUC(predict(tmodel1, newdata = balancedTrain), balancedTrain[, outcome]),
    calcAUC(predict(tmodel1, newdata = dCal), dCal[, outcome]),
    calcAUC(predict(tmodel1, newdata = dTest), dTest[, outcome])
)

decision.tree.1
## [1] 0.9783275 0.9895650 0.9773116
# MODEL TWO

prefixed.vars <- paste0('pred', c(catVars, numericVars))
fmla2 <- paste0(outcome,"== 'red' ~ ", paste(prefixed.vars, collapse = " + "))
tmodel2 <- rpart(fmla2, data = balancedTrain)

decision.tree.2 <- c(
    calcAUC(predict(tmodel2, newdata = balancedTrain), balancedTrain[, outcome]),
    calcAUC(predict(tmodel2, newdata = dCal), dCal[, outcome]),
    calcAUC(predict(tmodel2, newdata = dTest), dTest[, outcome]))

decision.tree.2
## [1] 0.9824090 0.9915335 0.9853474
# MODEL THREE

tmodel3 <- 
    rpart(
          fmla1, data = balancedTrain, 
          control = rpart.control(cp = 0.001, minsplit = 1000, minbucket = 1000, maxdepth = 5)
        )

decision.tree.3 <- c(
    calcAUC(predict(tmodel3, newdata = balancedTrain), balancedTrain[, outcome]),
    calcAUC(predict(tmodel3, newdata = dCal), dCal[, outcome]),
    calcAUC(predict(tmodel3, newdata = dTest), dTest[, outcome]))

decision.tree.3
## [1] 0.9661766 0.9790952 0.9641451
# DISPLAY RESULTS

dt.matrix <- matrix(
                    c(decision.tree.1, decision.tree.2, decision.tree.3), 
                      nrow = 3, ncol = 3, byrow = FALSE, 
                      dimnames = list(c("Train", "Cal", "Test"), 
                                    c("decision tree 1", "decision tree 2", "decision tree 3"))
                    )
dt.matrix
##       decision tree 1 decision tree 2 decision tree 3
## Train       0.9783275       0.9824090       0.9661766
## Cal         0.9895650       0.9915335       0.9790952
## Test        0.9773116       0.9853474       0.9641451

It is evident that all our models perform well on the training set as well as generalising on the unseen data. The decision tree trained using the pred-prefix values (decision tree 2) performs best. Let’s try and explain these results.

First, let’s print the rules so that we can clearly see the quantitative structure.

rpart.rules(tmodel2, cover = TRUE, style = "tall")

Now, let’s visualise the hierarchical structure of our decision tree.

# DECISION TREE - MODEL 2 (COLOR == RED)
rpart.plot(tmodel2, box.palette="Reds")

For ease of explainability, small fitted values are displayed as white to indicate white wine classification. Large fitted values are displayed as red to indicate a red wine classification. A continuous color scale exists for inbetween values.

The important takaways from this model are:

We should investigate if we can achieve similar results with only the two most important features. We will use the non pred-prefixed values for our investigation.

# FEATURE SELECTION MODEL

fmlaFS <- paste0(outcome,"== 'red' ~ ", paste(c("total.sulfur.dioxide", "chlorides"), collapse = " + "))
  
tmodel4 <- 
    rpart(
          fmlaFS, data = balancedTrain, 
          control = rpart.control(cp = 0.001, minsplit = 1000, minbucket = 1000, maxdepth = 5)
        )

decision.tree.4 <- c(
    calcAUC(predict(tmodel4, newdata = balancedTrain), balancedTrain[, outcome]),
    calcAUC(predict(tmodel4, newdata = dCal), dCal[, outcome]),
    calcAUC(predict(tmodel4, newdata = dTest), dTest[, outcome]))

decision.tree.4
## [1] 0.9661766 0.9790952 0.9641451

Wow, you can see that we still get amazing results from just two feature variables. It seems that total.sulfur.dioxide and chlorides are highly explainable of wine color. However, we will still stick with our best model when comparing results in the next section.


1.5 Wine Color Classification Model Comparisons / Evaluation

Having modeled our logistic regression and our three decision trees, we create the visual to compare them against each other and the null model:

Now that we have created our models, we can compare the performance against the null model. As we have seen that our classifiers perform very well, it would be better to compare against a saturated model. Below are the calculations for a Bayes rate model.

# SELECT ONLY THE FEATURE VARS
wine.features <- subset(wine, select = -c(color) )

# FIND ROWS THAT ARE DUPLICATED
duplicates <- duplicated(wine.features) | duplicated(wine.features, fromLast = TRUE)

# UPDATE FEATURE DF
wine.features <- wine.features[ duplicates, ]
print(sprintf("Feature variables have %d duplicated rows", dim(wine.features)[1]))
## [1] "Feature variables have 2172 duplicated rows"
# ADD BACK THER LABELS
wine.features$color <- wine$color[duplicates]

# CHECK IF THERE ARE DUPLICATES
duplicates <- duplicated(wine.features) | duplicated(wine.features, fromLast = TRUE)
wine.features <- wine.features[ duplicates, ]
print(sprintf("Feature variables with labels have %d duplicated rows", dim(wine.features)[1]))
## [1] "Feature variables with labels have 2169 duplicated rows"

We can conclude that there are \(2172\) observations that have the same feature variable values. \(2169\) of these have matching labels (color). Thus \(2172 - 2169 = 3\) observations have identical feature variables with different colors. Thus, a Bayes rate model will be a perfect model that only makes \(3\) mistakes when there are multiple examples with the same set of known facts but have different outcomes.

The saturated model is calculated below.

mistakes <- 3
obs <- dim(wine)[1]

saturated.model <- (obs - mistakes) / obs

saturated.model
## [1] 0.9995379

Now let us visualise the comparisons.

# CONSTRUCT EVAL DF
eval.df <- data.frame(
                      Model = rep(c("decision tree 1", "decision tree 2", "decision tree 3", "binary logistic model"), each = 3),
                      Data = rep(c("Train", "Cal", "Test"), times = 4),
                      Value = c(decision.tree.1, decision.tree.2, decision.tree.3, log.pred)
                    )

# PLOT DF RESULTS
ggplot(eval.df, 
      # PLOT RESULTS
      aes(x = factor(Model, levels = unique(Model)), y = Value, fill = factor(Data, levels = unique(Data)))) +
      geom_bar(stat = "identity", position = "dodge") +
      geom_text(aes(y = 1.02, label = round(Value, digits = 3)), position = position_dodge(width = 1), size = 4) +
    
      # PLOT NULL MODEL
      geom_hline(yintercept = 0.7537, color = "black") +
      annotate(geom = "text", x = 4.3, y = 0.77, label = "NULL Model", colour = "black") +
  
      # PLOT SATURATED MODEL
      geom_hline(yintercept = saturated.model, color = "black") +
      annotate(geom = "text", x = 0.6, y = 0.99, label = "Saturated Model", colour = "black") +
    
      # LABELS AND AESTHETICS
      ylim(0, 1.2) +
      ylab("Value") +
      xlab("Model") +
      labs(title="Model Evaluation") +
      global_theme +
      theme(legend.title = element_blank())

We can see that all our models greatly outperform the null model and are very closely bounded to the saturated model. We can identify that the decision tree 2 (trained on the pred-prefix values) slightly outperforms the binary logistic regressor. However, all models perform very well.

Let’s select the decision tree 2 as our preferred model and further analise.

# GET PERFORMANCE MEASURES
performanceMeasures <- function(pred, truth, name = "model") {

    # DEV
    dev.norm <- -2 * logLikelihood(truth, pred)/length(pred)
    
    # ACCURACY, PRECISION AND RECALL
    ctable <- table(truth = truth==pos, pred = (pred > 0.5))
    accuracy <- sum(diag(ctable)) / sum(ctable)
    precision <- ctable[2, 2] / sum(ctable[, 2])
    recall <- ctable[2, 2] / sum(ctable[2, ])
    
    # F1 SCORE
    f1 <- 2 * precision * recall / (precision + recall)
    
    # RETURN DF WITH PERFORMANCE MEASURES
    data.frame(model = name, precision = precision, recall = recall,f1 = f1, dev.norm = dev.norm)
    
}

# FORMATTER
panderOpt <- function() {
  
  # SET PANDER OPTIONS
  panderOptions("plain.ascii", TRUE)
  panderOptions("keep.trailing.zeros", TRUE)
  panderOptions("table.style", "simple")
  
}

# PRETTY PERFORMANCE TABLE FUNCTION
pretty_perf_table <- function(model, training, test) {
  
  # SET PANDER OPTIONS
  panderOpt()
  perf_justify <- "lrrrr"
  
  # COMPARING PERFORMANCE ON TRAIN VS TEST
  pred_train<-predict(model,newdata=training)
  truth_train <- training[,outcome]
  pred_test<-predict(model,newdata=test)
  truth_test <- test[,outcome]
  trainperf_tree <- performanceMeasures(
  pred_train,truth_train,"training")
  testperf_tree <- performanceMeasures(
  pred_test,truth_test, "test")
  perftable <- rbind(trainperf_tree, testperf_tree)
  pandoc.table(perftable, justify = perf_justify)
  
}

# PLOT ROC CURVE
plot_roc <- function(predcol1, outcol1, predcol2, outcol2) {

  roc_1 <- rocit(score=predcol1,class=outcol1==pos)
  roc_2 <- rocit(score=predcol2,class=outcol2==pos)

  plot(roc_1, col = c("blue","green"), lwd = 3,
  legend = FALSE,YIndex = FALSE, values = TRUE)
  lines(roc_2$TPR ~ roc_2$FPR, lwd = 1, col = c("red","green"))
  legend("bottomright", col = c("blue","red", "green"),
  c("Test Data", "Training Data", "Null Model"), lwd = 2)
  
}


# PLOT ROC CURVE
plot_roc2 <- function(predcol1, outcol1, predcol2, outcol2, predcol3, outcol3, predcol4, outcol4) {

  roc_1 <- rocit(score=predcol1,class=outcol1==pos)
  roc_2 <- rocit(score=predcol2,class=outcol2==pos)
  roc_3 <- rocit(score=predcol3,class=outcol3==pos)
  roc_4 <- rocit(score=predcol4,class=outcol4==pos)

  plot(roc_1, col = c("blue","green"), lwd = 3, legend = FALSE,YIndex = FALSE, values = TRUE)
  
  lines(roc_2$TPR ~ roc_2$FPR, lwd = 1, col = c("red","green"))
  
  lines(roc_3$TPR ~ roc_3$FPR, lwd = 1, col = c("orange","green"))
  
  lines(roc_4$TPR ~ roc_4$FPR, lwd = 1, col = c("purple","green"))
  
  legend("bottomright", col = c("blue","red", "orange", "purple", "green"),c("dtree 1", "dtree 2", "dtree 3", "binary logistic", "Null Model"), lwd = 2)
  
}
pretty_perf_table(tmodel2, balancedTrain, dTest)
## 
## 
## model        precision   recall      f1   dev.norm
## ---------- ----------- -------- ------- ----------
## training        0.9836   0.9685   0.976     0.2086
## test            0.9500   0.9500   0.950     0.2058
pred_test_roc <- predict(tmodel2,newdata=dTest)
pred_train_roc <- predict(tmodel2,newdata=balancedTrain)

plot_roc(pred_test_roc, dTest[[outcome]], pred_train_roc, balancedTrain[[outcome]])

The above further validates our high performing model with the ROC curves for both the test and train dataset show high AUC values. The performance table shows a small deviance. It also shows a high f1 score indicating that both the precision and recall are high.

For completeness, we will plot all the models ROC curves for comparison.

tmodel1_roc <- predict(tmodel1,newdata=dTest)
tmodel2_roc <- predict(tmodel2,newdata=dTest)
tmodel3_roc <- predict(tmodel3,newdata=dTest)
logistic_roc <- predict(gmodel,newdata=dTest)

plot_roc2(tmodel1_roc, dTest[[outcome]], tmodel2_roc, dTest[[outcome]], tmodel3_roc, dTest[[outcome]], logistic_roc, dTest[[outcome]])


1.6 Elevated Task

Simple Linear Regressor

We have found that classifying color is a pretty trivial task and a very simple model can bring great results. Let’s now test our luck with solving a multi-class classifier problem to predict wine quality. An interesting approach would be to construct a regression model and round the output to the nearest whole number when making predictions.

First, let’s create a new test and train set.

# 80-20 SPLIT FOR TRAIN AND TEST
sampleSize <- round(nrow(wine)*0.8)
idx <- sample(seq_len(sampleSize), size = sampleSize) # sample indexes

# CONSTRUCT SETS
X.train.reg <- wine[idx,]
X.test.reg <- wine[-idx,]

And now we can construct a simple linear regression model to fit the training data.

# CREATE OUR LINEAR REGRESSION MODEL
model.reg1 <- lm(quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, 
                 data = X.train.reg)

summary(model.reg1)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = X.train.reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5461 -0.4683 -0.0287  0.4822  2.5629 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.008e+01  6.218e+00  -3.229 0.001250 ** 
## fixed.acidity        -2.440e-02  1.203e-02  -2.027 0.042671 *  
## volatile.acidity     -1.416e+00  8.111e-02 -17.453  < 2e-16 ***
## citric.acid          -8.294e-02  8.792e-02  -0.943 0.345535    
## chlorides            -1.188e+00  3.476e-01  -3.418 0.000636 ***
## free.sulfur.dioxide   7.358e-03  8.860e-04   8.305  < 2e-16 ***
## total.sulfur.dioxide -1.787e-03  3.004e-04  -5.949 2.88e-09 ***
## density               2.269e+01  6.288e+00   3.608 0.000312 ***
## pH                   -2.265e-02  7.999e-02  -0.283 0.777047    
## sulphates             6.855e-01  7.964e-02   8.607  < 2e-16 ***
## alcohol               3.616e-01  1.376e-02  26.274  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.741 on 5183 degrees of freedom
## Multiple R-squared:  0.2997, Adjusted R-squared:  0.2983 
## F-statistic: 221.8 on 10 and 5183 DF,  p-value: < 2.2e-16

The model summary shows that many predictor variables are insignificant and the adjusted R-squared is poor. Let’s continue inspecting our first model.

X.test.reg$predQuality <- predict(model.reg1,newdata=X.test.reg)
X.train.reg$predQuality <- predict(model.reg1,newdata=X.train.reg)
p<- ggplot(data=X.test.reg) +
  geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
  
  geom_smooth(aes(x=predQuality,y=quality),color="black") +

  geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +

  scale_x_continuous(limits=c(4,8)) +
  scale_y_continuous(limits=c(4,8)) +
  global_theme

p

The black line indicates the average relation between the predicted and actual quality and the blue line indicates the ideal relationship.

Improved Regressor

Firstly, let’s check outliers of the dataset whether any high leverage high influence exist. We could use four plots here, i.e. Residuals vs Fitted, Normal Q-Q, Cook’s Distance, and Residuals vs Leverage. For your information regarding those plots, you could read

One way to improve our linear model is to check for outliers. Let’s look at the Residual vs Fitted, Q-Q, Cook’s Distance, and Residuals vs Leverage plots.

par(mfrow=c(2,2))

plot(model.reg1, which = 1, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 2, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 4, cook.levels = c(0.05, 0.1)) %>% invisible()
plot(model.reg1, which = 5, cook.levels = c(0.05, 0.1)) %>% invisible()

  • The Residual vs Fitted plot will show if residuals have non-linear patterns.
    • In this case, we have no non-linear relationship that are not explained by the model.
  • The Q-Q plot shows if the residuals are normally distributed.
    • This plot is relatively linear with a few off values at low theoretical quantiles.
    • off vals index: 4582, 3707, 2423
  • We have a few values outside of the Cook’s distance
    • off vals index: 2143, 4582, 4955
  • The Residual vs Leverage plot informs us of influential cases as not all outliers are influential in linear regression analysis.
    • off vals index: 2143, 4933, 4582

Let’s now remove these identified outliers and see if we have an improved model.

# OUTLIER IDX
to.rm <- c(4582,3707,2423,2143,4955,4933)

# REMOVE OUTLIERS
X.train.reg <- X.train.reg[-to.rm,]

Now let’s make another model.

model.reg2 <- lm(quality ~ fixed.acidity + volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates + alcohol, 
                 data = X.train.reg)

summary(model.reg2)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates + 
##     alcohol, data = X.train.reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5283 -0.4647 -0.0314  0.4832  2.5708 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.922e+01  6.000e+00  -3.203 0.001369 ** 
## fixed.acidity        -2.630e-02  9.801e-03  -2.683 0.007309 ** 
## volatile.acidity     -1.385e+00  7.265e-02 -19.064  < 2e-16 ***
## chlorides            -1.233e+00  3.422e-01  -3.602 0.000319 ***
## free.sulfur.dioxide   7.368e-03  8.858e-04   8.318  < 2e-16 ***
## total.sulfur.dioxide -1.809e-03  2.874e-04  -6.294 3.36e-10 ***
## density               2.176e+01  5.993e+00   3.630 0.000286 ***
## sulphates             6.823e-01  7.819e-02   8.726  < 2e-16 ***
## alcohol               3.592e-01  1.333e-02  26.939  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.741 on 5179 degrees of freedom
## Multiple R-squared:  0.2996, Adjusted R-squared:  0.2985 
## F-statistic: 276.9 on 8 and 5179 DF,  p-value: < 2.2e-16

Let’s compare performance measures between both models to identify if there is any improvement.

reg.performance <- function(model, y_pred, y_true) {
     adj.r.sq <- summary(model)$adj.r.squared
     mse <- MSE(y_pred, y_true)
     rmse <- RMSE(y_pred, y_true)
     mae <- MAE(y_pred, y_true)
     print(paste0("Adjusted R-squared: ", round(adj.r.sq, 4)))
     print(paste0("MSE: ", round(mse, 4)))
     print(paste0("RMSE: ", round(rmse, 4)))
     print(paste0("MAE: ", round(mae, 4)))
}
# MODEL 1 PERFORMANCE
reg.performance(model = model.reg1, y_pred = model.reg1$fitted.values, y_true = X.train.reg$quality)
## [1] "Adjusted R-squared: 0.2983"
## [1] "MSE: 0.8319"
## [1] "RMSE: 0.9121"
## [1] "MAE: 0.7176"
# MODEL 2 PERFORMANCE
reg.performance(model = model.reg2, y_pred = model.reg2$fitted.values, y_true = X.train.reg$quality)
## [1] "Adjusted R-squared: 0.2985"
## [1] "MSE: 0.5481"
## [1] "RMSE: 0.7403"
## [1] "MAE: 0.5771"

It is clear that the linear model trained on a dataset without outliers outperforms the first model as seen in the drop of the error metrics. The adjusted R-squared value does not vary between model indicating poor performance. Let’s finish this section by plotting the best model.

X.test.reg$predQuality <- predict(model.reg2,newdata=X.test.reg)
X.train.reg$predQuality <- predict(model.reg2,newdata=X.train.reg)
# MODEL 2 PERFORMANCE ON TRAIN
p1 <- ggplot(data=X.train.reg) +
  geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
  
  geom_smooth(aes(x=predQuality,y=quality),color="black") +

  geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +

  scale_x_continuous(limits=c(4,8)) +
  scale_y_continuous(limits=c(4,8)) +
  
  labs(title = "Train Data") +
  global_theme

# MODEL 2 PERFORMANCE ON TEST
p2 <- ggplot(data=X.test.reg) +
  geom_point(aes(x=predQuality,y=quality),alpha=0.2,color="black") +
  
  geom_smooth(aes(x=predQuality,y=quality),color="black") +

  geom_line(aes(x=quality,y=quality),color="blue",linetype=2) +

  scale_x_continuous(limits=c(4,8)) +
  scale_y_continuous(limits=c(4,8)) +
  
  labs(title = "Test Data") +
  global_theme

grid.arrange(p1, p2,  ncol=2)

Some final remarks

The linear regression model has been examined and improved. It performs ineffective in modelling the train dataset as seen in the plot above. The best adjusted R-squared value produced is only at \(0.2985\). This poor performance in the training set reflect the poor performance on the test set. It is evident that a linear regression model is not suitable for this data.

It was earlier mentioned that perhaps we care more about if a wine is a low quality versus a medium quality as apposed to a quality \(6\) versus a quality \(7\) wine. Let’s have a look at the raw errors of our prefered models predictions. Remember, we round these predictions to the nearest whole number to give a categorical score between \(1-10\) as previously described.

# DISPLAY RAW ERRORS - AFTER ROUNDING PREDICTION
raw.errors <- round(model.reg2$residuals, 0)
qplot(raw.errors,geom="bar") +
  global_theme

It is evident that most of the mistakes that the model makes is within a margin of \(1\). That is, most mistakes either misclassify by a quality of one more, or one less than the expected outcome. The confidence interval of the approximate normal distribution verifies this conjecture.

mu <- mean(raw.errors)
sigma <- sd(raw.errors)
z <- 1.645 # 90% confidence interval

confidence_interval(mu, sigma, z)
## [1] -1.313802  1.345799

Depending on what error tolerance we are willing to consider will depend on weather this model is actually viable or not.


2. Clustering Task

We now wish to explore the relationships between different wine qualities of both red and white wine according to their alcohol content. We will rework the original dataset such that wine qualities are categorised into low, med and high. We will group these observations according to the color (e.g. “high.red” or “med.white”)

2.1 Data Preparation for Clustering

We are going to work with the original dataset.

head(wine)

Let’s first bin our quality variables into more low-med-high categories.

c.df <- wine

# REBIN QUALITY FEATURE
c.df$quality.cat <- cut(c.df$quality, breaks=seq(0, 10, 3), labels=c("low", "medium", "high"))

str(c.df)
## tibble [6,492 × 16] (S3: tbl_df/tbl/data.frame)
##  $ fixed.acidity       : num [1:6492] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num [1:6492] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num [1:6492] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num [1:6492] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:6492] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num [1:6492] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num [1:6492] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:6492] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:6492] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:6492] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:6492] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:6492] 5 5 5 6 5 5 5 7 7 5 ...
##  $ good                : num [1:6492] 0 0 0 0 0 0 0 1 1 0 ...
##  $ color               : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality.factor      : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ quality.cat         : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...

Now let’s bind wine color to this newly created feature.

c.df <- c.df %>% mutate(quality.color = factor(paste0(quality.factor, ".", color)))

clust.df <- c.df %>% 
            group_by(quality.color) %>%
            subset( select = c(quality.color, alcohol, volatile.acidity) )

set.seed(42)
clust.df <- clust.df[sample(nrow(clust.df), 60), ]

It is important to notice that our data has differing units. For example, residual sugar is measured in \(\frac{g}{dm^3}\) and free sulfur dioxide is measured in \(\frac{mg}{dm^3}\). Disparity in units will affect what clustering an algorithm will discover thus it is important to standardise our numerical variables so that all features give the same affect. It may be interesting to see the results when converting all variables into SI units but we are going to keep things simple and use standardise.

We will use euclidean distance for our clustering.

vars.to.use <- colnames(clust.df[-1])
pmatrix <- scale(clust.df[, vars.to.use])

pcenter <- attr(pmatrix, "scaled:center")
pscale <- attr(pmatrix, "scaled:scale")

d <- dist(pmatrix, method = "euclidean")

2.2 Finding Oprimal K

Before performing our clustering, we need to find the optimal value of k. The k value will determine the number of clusters that we model against and can be derived through the average distance from each point to the centre of its cluster and the average distance between clusters.

sqr_edist <- function(x, y) {
  sum((x - y) ^ 2)
}

wss.cluster <- function(clustermat) {
  c0 <- apply(clustermat, 2, FUN = mean)
  sum(apply(clustermat, 1, 
            FUN = function(row) {sqr_edist(row, c0)}))
}

wss.total <- function(dmatrix, labels) {
  wsstot <- 0
  k <- length(unique(labels))
  for(i in 1:k){
    wsstot <- wsstot + 
      wss.cluster(subset(dmatrix, labels == i))
  }
  wsstot
}

totss <- function(dmatrix) {
  grandmean <- apply(dmatrix, 2, FUN = mean)
  sum(apply(dmatrix, 1, 
            FUN=function(row) {
              sqr_edist(row, grandmean)
              }
            )
      )
}

The Calinski-Harabasz index (CH) is defined as the ratio of inter-cluster variance to the total intra-cluster variance and is a popular metric for selecting the optimal value for k.

Let’s create a function to compute this metric.

rm(var)

ch_criterion <- function(dmatrix, kmax, method = "kmeans") {
  if(!(method %in% c("kmeans", "hclust"))){ 
    stop("method must be one of c('kmeans', 'hclust')")
  }
  npts <- nrow(dmatrix)
  totss <- totss(dmatrix)
  wss <- numeric(kmax)
  crit <- numeric(kmax)
  wss[1] <- (npts - 1) * sum(apply(dmatrix, 2, FUN = var))
  
  for(k in 2:kmax) {
    if(method == "kmeans") {
      clustering <- kmeans(dmatrix, k, nstart = 10, iter.max = 100)
      wss[k] <- clustering$tot.withinss
    }else {
      d <- dist(dmatrix, method = "euclidean")
      pfit <- hclust(d, method = "ward")
      labels <- cutree(pfit, k = k)
      wss[k] <- wss.total(dmatrix, labels)
    }
  }
  bss <- totss - wss
  crit.num <- bss / (0:(kmax - 1))
  crit.denom <- wss/(npts - 1:kmax)
  list(crit = crit.num / crit.denom, wss = wss, bss = bss, totss = totss)
}

Now let’s plot the CH index over a range of different k values.

clustcrit <- ch_criterion(pmatrix, 12, method = "hclust")
critframe <- data.frame(k = 1:12, ch = scale(clustcrit$crit), wss = scale(clustcrit$wss), bss = scale(clustcrit$bss))
critframe <- melt(critframe, id.vars = c("k"), variable.name = "measure", value.name = "score")
ggplot(critframe, aes(x = k, y = score, color = measure)) +
    geom_point(aes(shape = measure), size = 2) + 
    geom_line(aes(linetype = measure), size = 1) +
    scale_x_continuous(breaks = 1:20, labels = 1:20) +
    global_theme

Typically, we look for where the benefit of an increased number of clusters slows down and select this point as the value of k. Visually, we will see a distinct ‘elbow’ in the plot above. Three or four look look promising. Let’s compute the best value using R. NbClust finds the optimum value for k reported over \(30\) different clustering indexes.

This algorithm has a poor time complexity. A method to work around this issue is to create a representative sample of the data set. We should repeat the algorithm a number of times on different samples so that we do not miss any fine details in the population set.

# FIND OPTIMAL K USING NbClust
library(NbClust)
library(factoextra)

n <- 100
epoch <- 5
kVals <- c()


# CREATE EPOCH DIFF SAMPLES TOO SEE IF CONCENSUS MET ON k
for (i in 1:epoch) {
  
  smple_ <- sample_n(clust.df, n, replace = TRUE)
  k.select_ <- smple_ %>% subset( select = -c(quality.color) ) %>% NbClust(method = "ward.D2", index = "all", min.nc = 2, max.nc = 15)
  
  kVals <- c( kVals, length(unique(k.select_$Best.partition)) )
  
} 

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 3 proposed 13 as the best number of clusters 
## * 5 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 12 proposed 3 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 12 proposed 3 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Let’s view the results from the final sample created in the loop above.

# FIND NbClust BEST NUMBER OF CLUSTERS
fviz_nbclust(k.select_, barfill = "#00BFC4", barcolor = "#00BFC4") + 
    theme_gray() +
    ggtitle("NbClust's optimal number of clusters")
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 12 proposed  3 as the best number of clusters
## * 2 proposed  13 as the best number of clusters
## * 1 proposed  14 as the best number of clusters
## * 4 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

We can identify that the majority of clustering indexes selected \(3\) as the optimum value for k for a given sample. The below sumarises the results.

kVals
## [1] 3 3 3 3 3

Now let’s check the stability of these clusters.

# BEST k
kbest.p <- 3

# RUNNING CLUSTERBOOT
cboot.hclust <- clusterboot(pmatrix,
                            clustermethod = hclustCBI,
                            method = "ward.D",
                            k = kbest.p,
                            count=FALSE)

# CHECK STABILITY
groups <- cboot.hclust$result$partition

1 - cboot.hclust$bootbrd / 100
## [1] 0.97 1.00 0.84

We can see that all clusters are highly stable so we can confirm that \(3\) is the best value of k.

2.3 Hierarchical Clustering Method

The clustering methodology we will use is hierarchical clustering. The below dendrogram displays each of our 42 ‘Road User Age’ groups in their respective cluster.

# CONSTRUCT DENDROGRAM
pfit <- hclust(d, method="ward.D2")
plot(pfit, labels = clust.df$quality.color)
rect.hclust(pfit, k = kbest.p)

Let’s plot the clusters for better visualisation using principal component transformation.

# PRINCIPAL COMPONENT TRANSFORMATION
groups <- cutree(pfit, k = kbest.p)
princ <- prcomp(pmatrix)
nComp <- 2
project <- as.data.frame(predict(princ, newdata=pmatrix)[,1:nComp])
project.plus <- cbind(project, cluster=as.factor(groups), Group=clust.df$quality.color)

h <- do.call(
    rbind, 
    lapply(
        unique(groups),
        function(c) { 
            f <- subset(project.plus, cluster == c); 
            f[chull(f), ]
        }
    )
)
# VISUALISE RESULTS

ggplot(project.plus, aes(x = PC1, y = PC2)) +
    geom_point(aes(shape = cluster, color = cluster)) +
    geom_text(aes(label = Group, color = cluster),
              hjust = 0, vjust = 1, size = 3) +
    geom_polygon(data = h, aes(group = cluster, fill = as.factor(cluster)),
                 alpha = 0.4, linetype = 0) +
    theme(legend.position = "") +
    global_theme

The above indicates three distinct clusters with no overlap. It is an interesting observation to see that various wine quality/colors have some degree of clustering together. The blue cluster consists of almost all quality \(5\) wines. The green cluster presents mostly high quality white wines and the red cluster has many mid to high level wines that are predominantly white. This suggests a relationship does exist between these groups and the alcohol content and volatile acidity of wine.

It should be mentioned that a small representative sample of \(50\) instances is used to produce the above cluster. I used different seeds values to produce clusters from other representative samples and can conclued that the result above generalises well and is representative of the entire data set. This experimentation is ommited from this final report.


Conclusion

In this exploration and analysis of Wine Quality data set we faced a number of challenges:

These challenges reinforce the need for care to be taken when modelling data. This is especially important when these solution affect the domain in some way. Our analysis did uncover valuable relationships in the chemistry of wine. One such example is that better wine qualities tend to have a higher alcohol content. This makes me question how the quality of wine is assessed. Is there bias toward how alcohol affects the decision of a wine being good or bad? The analysis also outlines that not all chemical properties are valuable where feature selection techniques give similar or better results.

It would be interesting to further investigate which combination of chemical properties create the best wines. This could aid wine makers in raising the quality of their wines.